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Building Composable Data 
Microservices with Apache Arrow 
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Who are we? (Bloomberg) 


e 8,000+ software engineers 


e Product areas: 

o Data 
Analytics 
News 
Communication 
Electronic trading 


O O O O 


e Lots of teams who need to process large 
datasets and publish analytics 
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Exponential Growth in Market Data Ingestion 


300 Billion Messages per ۱ La 


200 Billion Messages per Day 


100 Billion Messages per Day 


US Debt Ceiling Crisis 
(S&P Downgrade) | 


Stimulus Tapering 
Discussion 


WEED, oi 


| 


| 


Japan Earthquake 
& Tsunami 
2008 Financial 
risis 
Flash Crash 
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Who are we? (Bloomberg Indices) 


1. Ingest bond data 

3. Validate results 
Daily bond data is acquired from various 
internal and external data sources into our 
data stores, after which it is cleaned, 
massaged, and enriched. 


Product/Data/Operations validates any 
potentially anomalous data points. 


2. Compute indices 4. Publish to clients 

For each index, identify the members and Once the data has been validated, the 
their relative weights in the index, and indices are published to the respective 
compute index-level statistics. clients. 
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Who are we? (Bloomberg Indices) 


e Fixed Income Indices 
o $50 Trillion dollar market value 
o Batch processing 
ه‎ Tight SLAs - clients are expecting reports ASAP 


e Data scale 
o  200K-* bonds (rows), 1000s of columns of data per day 
o 25K+ indices produced each day 
o  Upto 60-70K constituents (rows) per index 
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Why do we care about 
microservices? 
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Where did we come from? 


Monolith: Pros & Cons 


© Pros 
e Performance 
e Programming practices 
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e Coupling 
e Impact radius 


© Ownership 
o Who owns the Monolith? 


e Release cadence 
e Choice of languages 
e Build processes 
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Splitting things out into services 


service_d 


Monolith 
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Going all the way 


© Pros 

e Decoupled 

Impact radius 
Ownership 
Release cadences 
Choice of language 
Build processes 


service_b 


service_d 
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service_b 


service_d 
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New complexities... 


€ Cons 
e Performance 
e Development costs 
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Problem: CPU encoding and decoding 


e CPU time spent encoding/decoding data for large requests/responses 
(~64MB) 
o C++ to C++: 50% 
o C++ to Python: 90% 


request 


in-memory response structure wire Format #iPRerent+ in-memory 
data model 
data model 
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© Pros 
e Latency 


€ Cons 
e Band-Aid solution 
o Encode / decode still exists 
o Resource wastage 


e Operational complexity 
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Solution: Scaling out? 
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(horizontal ۳ scaled) 
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Solution: Libraries? 


© Pros 


e Latency 
e No encoding/decoding 


© Cons 
e Becoming a monolith 
e Multiple versions 
e Access control 
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The “composability theorem” 


e Must have composability 
e Trading simplicity for performance 


Performance Simplicity 
TechAtBloomberg.com Jia tm Bloomberg 
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Engineering 


Fragmentation is inevitable 
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Problem: Lack of standardization 


e Every service defines its own schema and internal data model 
e Lots of time spent designing schemas for new systems 
e Lots of time spent connecting systems 


requ est 


in-memory response structure wire Format #iPRerent+ in-memory 
data model 
data model 
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Solution: Shared data model? 


© Pros © Cons 
e Some consistency e Increased schema complexity 


across service schemas o  Variant/union types 
e In-memory format != on-wire format 


Service A Service B Service C 


in-memory shared schema «cli ecewt« 
data mocle| Fragment in-memory 
data model 
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Solution: Organic localized standardization 


© Pros 


e Some reusability 
e Overall less fragmentation 


© Cons 
e Local optimum 
e Performance problems unsolved 


in-memory services 


data model 
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In Summary 


. . Services + shared 
۱ Services Services + 
Monolith . 1 data models and 
only Libraries 
converters 


Performance v X v X 
Isolation x v A ۷ 
Maintainability x X X A 
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A theme emerges 


e A good solution to composability would: 
Standardize both in-memory and on-wire formats 
Reduce the cost of integration 

Not add unnecessary complexity 

Not add significant performance penalties 


O O O O 
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In the meantime... 
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Offline analysis 


e Goal: Using daily bond data for offline analysis N 
o Compliance / troubleshooting 
o Product ideation 

e Apache Parquet for analytics with Trino 


e Apache Arrow as an intermediate format 


See 
لس‎ 7 


CZ 


0 7 7 
LZ VAM vias cr t, 7, 1 
LEE 
KEE LLLLL, 


2111211102 


TechAtBloomberg.com ew SÉ Bloomberg 
© 2023 Bloomberg Finance L.P. All rights reserved. " 2 3 : , iu E ps E 1 e — 6 ۱ ی‎ 
ا‎ A nr sn EM ge POE. (oo E EE a e EA 2 
e ۴ a usa e dg ax e 35 KA > (éi, mn $. ۳ Té e A 9 1م‎ Ze 


What is Apache Arrow? 


Columnar 


Memory Format 


7 


Logical Table 


Col A Col B Col c Col A 


Row 1 

Row a T) SÀ سس‎ 

S BEE ` JF 
Col C 


D =‏ سن هد دو س هد دو سا 


Tabular 
Columnar 

Fast data transfer 
IPC = In-memory 
Cache friendly 
SIMD friendly 
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Example uses of Apache Arrow 


Data Transfer | Compute Engines | Visualization 
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New use-case 


e Python application for data validation / reconciliation 


e Python app experiencing significant slowness 
o Culprit: ser/de was taking 90% of the time! 
o Brute force was not an option... 
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Revisiting the status quo 


request 


in-memo response structure wire Format «diPPerewte in-memory 
data model data model 
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How about Arrow? 


(CO Plans BN, 


in-memory pd.DataFrame 
data model 
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How did we do this? 


e Integrate Arrow IPC with middleware 

o e.g., Arrow Flight 
e Got back almost all of the 90% ser/de time 
e Just use pandas! 


417 7 / 0 ۱ = 7, | 
Uf 5 NN 
7, / 7 , | 


in-memory pd.DataFrame 
data model 
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Bloomberg 


Adoption: slow but steady 


e Headline features (in Indices) 
o For us, data transport is no longer a bottleneck! 
o Interoperability with pandas 


e Growing usage across Bloomberg 
o  |n-memory data format for custom analytics engines 
o Faster data transport across applications 
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Can Apache Arrow solve our problems? 


e Standardizes both in-memory and on-wire formats W 
e Does not add significant performance penalties W 
e Reduces the cost of integration? © 


e Does not add unnecessary complexity? ^ 
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Taking it a step further 


in-memory 
data moclel 
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bonds’ 
SecId Price' 
BBGOO1 2.00 
BBGOO2 11.45 
BBG003 6.39 
BBGOO4 14.20 
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Kinds of business logic 


e Validation 
e Searching 


fx rates 
Currency Rate 
USD 1.00 
JPY 149.30 
GBP 0.83 


Price 


e Enrichment 
e Aggregation 


bonds 


SecId Currency 


BBGOO1 USD 
BBGOO2 GBP 
BBGOO3 JPY 
BBGOOA USD 
TechAtBloomberg.com 
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request 


How can we do it on Apache Arrow? 
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In-Memory Analytics 
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Compute Engines 


In-process library 
Declarative 
Query Optimization 
Computation 

o  Vectorization vs. JIT 
© Some are Arrow-native 
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Operations supported by compute engines 


Filter 

Project 
Aggregate 

Join 

Sort 

Window functions 
Pivot tables 
Unnest 
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Representing our business logic 


e Can we represent our logic declaratively? 


e Enrichment — Join / Project e Validation — Project / Filter 
e Aggregation — GroupBy / Agg e Searching Filter 


SELECT 
b.x, 
(b.Price x £x.Rate) AS NormalizedPrice 


FROM bonds 0 
LEFT JOIN fx_rates £x ON b.Currency == £x.Currency 
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Adopting DuckDB 


e C++ API + query optimizer + vectorized engine 
© Supports Arrow input/output 
e Observed order of magnitude performance improvement 
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Achieving greater ۱۷ 
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Learnings 
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Economies of scale 


e Lower amortized cost to build new tooling 
O Astandard is the foundation for generalizability 


e e.g., Arrow integration with custom middleware 
o No need to encode and decode anymore 
o Every new user gets the tooling for free 
o Every new user encourages other callers to onboard as well 


e Lots of newer open-source tooling built around Arrow 
O Driving developer costs down 


e Performance gains open up additional design options 
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Learning from others 


e Community-building helped us standardize and build the vision 
e Couldn't have thought of all use-cases in isolation 


e Alignment across teams is necessary to standardize effectively 
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A shift in mindset 


© Arrow is not... 
o Just an in-memory format or... 
o Just an on-the-wire format 


e |ts both 


o Selectively using Arrow for specific problems diminishes its value 
m eg, Using Arrow for just data transport but not in-memory analytics 


m Still have to pay developer costs for implementing connector logic 
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Thank you! 


We are hiring: 
https://www.bloomberg.com/careers 
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